Add CUDA Target, Runtime, and Kernel CI Support#1
Merged
Conversation
Test Results3 318 tests 3 318 ✅ 1h 54m 25s ⏱️ Results for commit 49f5205. ♻️ This comment has been updated with latest results. |
sunnycase
added a commit
that referenced
this pull request
Jul 2, 2026
This adds end-to-end CUDA support for the nncase NTT path and native runtime. It introduces a CUDA target/module compiler, CUDA runtime module loading and execution support, CUDA-aware NTT runtime primitives, CUDA kernel tests, and a dedicated Linux CUDA CI job for running those tests separately from the regular CPU/macOS compiler jobs. Motivation: - Enable nncase to compile NTT-generated kernels for CUDA and execute them through the native runtime, instead of stopping at code generation. - Keep CPU and CUDA generated modules on compatible runtime/operator ABI boundaries while allowing CUDA-specific device entry points and launch behavior. - Catch CUDA-specific regressions in CI without making ordinary Linux/macOS compiler jobs depend on GPU availability. - Fix issues uncovered while enabling CUDA tests, including CUDA toolkit discovery, device-callable scalar helpers, generated module ABI handling, and reduce-axis normalization in the NTT vectorization/lowering path. Implementation: - Added CUDA target plumbing in Nncase.Modules.NTT, including CUDATarget, CUDAModuleCompiler, target abstraction cleanup, and CUDA-aware C/CMake generation. - Added native CUDA runtime support with CUDA runtime module/function classes, loader integration, runtime CMake wiring, and ENABLE_CUDA_RUNTIME gating. - Added NTT CUDA runtime support for topology, remote tensors, distributed operations, vector ops, profiling, and CUDA runtime entry points. - Updated NTT kernels and runtime utilities so generated code can compile for both CPU and CUDA, including device-callable scalar conversions/operators for half and related scalar types. - Normalized negative reduce axes during NTT vectorization/lowering instead of IR construction, preserving IR semantics while fixing CUDA reduce vectorization cases. - Added CUDA kernel test coverage through UnitTestCUDAKernels and enabled it in CI with a dedicated test-x86_64-linux-cuda job. - Kept the regular compiler test job excluding UnitTestCUDAKernels so CPU-only Linux/macOS jobs remain independent from CUDA runtime availability. Validation: - git diff --check - YAML parsing for .github/workflows/compiler-build.yml - dotnet build modules/Nncase.Modules.NTT/Nncase.Modules.NTT.csproj -c Release --no-restore - Rebuilt the native runtime locally with clang, CUDA 12.8, and ENABLE_CUDA_RUNTIME=ON. - Verified the installed native runtime exposes CUDA runtime support and links against CUDA runtime libraries. - Verified generated CUDA modules compile locally with clang++ and CUDA 12.8 for the reduce/vectorization repro cases. - dotnet test src/Nncase.Tests/Nncase.Tests.csproj -c Release --no-build --no-restore --filter "FullyQualifiedName~Nncase.Tests.TargetTest.UnitTestCUDAKernels.TestVectorizeReduce": 8/8 passed locally. Limitations: - Full UnitTestCUDAKernels execution requires a Linux runner with an NVIDIA GPU, CUDA toolkit, nvcc, clang/clang++, and the labels self-hosted, linux, x64, cuda. - The dedicated CUDA CI job currently uses CUDA architecture 120, matching the local validation environment. - Future CUDA follow-up work is tracked in #2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds end-to-end CUDA support for the nncase NTT path and native runtime. It introduces a CUDA target/module compiler, CUDA runtime module loading and execution support, CUDA-aware NTT runtime primitives, CUDA kernel tests, and a dedicated Linux CUDA CI job for running those tests separately from the regular CPU/macOS compiler jobs.
Motivation
Implementation
Nncase.Modules.NTT, includingCUDATarget,CUDAModuleCompiler, target abstraction cleanup, and CUDA-aware C/CMake generation.ENABLE_CUDA_RUNTIMEgating.UnitTestCUDAKernelsand enabled it in CI with a dedicatedtest-x86_64-linux-cudajob.UnitTestCUDAKernelsso CPU-only Linux/macOS jobs remain independent from CUDA runtime availability.Validation
git diff --check.github/workflows/compiler-build.ymldotnet build modules/Nncase.Modules.NTT/Nncase.Modules.NTT.csproj -c Release --no-restoreENABLE_CUDA_RUNTIME=ON.clang++and CUDA 12.8 for the reduce/vectorization repro cases.dotnet test src/Nncase.Tests/Nncase.Tests.csproj -c Release --no-build --no-restore --filter "FullyQualifiedName~Nncase.Tests.TargetTest.UnitTestCUDAKernels.TestVectorizeReduce": 8/8 passed locally.Limitations
UnitTestCUDAKernelsexecution requires a Linux runner with an NVIDIA GPU, CUDA toolkit,nvcc,clang/clang++, and the labelsself-hosted,linux,x64,cuda. GitHub-hosted CPU runners cannot execute these runtime tests.120, matching the local validation environment.UnitTestCUDAKernels; CUDA runtime tests are expected to run only in the dedicated CUDA job.HF_HOMEcache/configuration in CI and should be revisited separately from the CUDA runtime work.Backlog
Future CUDA follow-up work is tracked in #2.